-
Notifications
You must be signed in to change notification settings - Fork 239
Svdquant huggingface checkpoint export support #754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a788b53 to
34e75e5
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #754 +/- ##
=======================================
Coverage 74.22% 74.23%
=======================================
Files 192 192
Lines 19035 19038 +3
=======================================
+ Hits 14129 14132 +3
Misses 4906 4906 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jingyu-ml
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, including the approach for fusing the QKV and FFN layers. The current resmooth + refusion process means the resulting model is not exactly identical to the original, but this appears to be the only viable option at the moment unless we can fuse these layers during calibration...
Thank you for your work!
| def svd(weight, rank): | ||
| original_device = weight.device | ||
| original_dtype = weight.dtype | ||
| weight_f64 = weight.to(dtype=torch.float64, device=original_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need f64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. I kept what @jingyu-ml has originally. This part is just a refactoring so that I can reuse this code during qkv fusion.
meenchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
modelopt/torch/export/quant_utils.py
Outdated
| for module in modules: | ||
| if not torch.equal(module.input_quantizer.pre_quant_scale, avg_prequant_scale): | ||
| _update_pre_quant_scale(module, avg_prequant_scale) | ||
| if hasattr(modules[0].weight_quantizer, "svdquant_lora_a"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can svdquant_lora_a be None in any case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will skip it when It is None.
|
Do we have unit tests for svd quant? |
I think we have unittest for svdquant, but not this export part. |
Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
📝 WalkthroughWalkthroughThis change introduces support for NVFP4 SVDQUANT (SVD-based quantization) throughout the modelopt export pipeline. It adds configuration options, defines a new quantization constant, extends quantization utilities to recognize and process SVD quantization, adds an SVD computation helper, and updates export logic to handle this quantization type consistently with existing variants. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 510-513: Update the NotImplementedError message string in the
conditional that checks args.inference_tensor_parallel,
args.inference_pipeline_parallel and args.qformat ("nvfp4_svdquant") to fix the
typo: change "mulitple" to "multiple" so the raised message reads "Svdquant does
not support multiple GPUs yet." Refer to the conditional using
args.inference_tensor_parallel, args.inference_pipeline_parallel, args.qformat
and the raised NotImplementedError to locate the change.
♻️ Duplicate comments (1)
modelopt/torch/export/quant_utils.py (1)
498-500: Fix SVDQuant detection to check actual LoRA buffers (not just attribute presence).
hasattr(weight_quantizer, "svdquant_lora_a")is true even when the buffer is unset, so non‑SVD NVFP4 quantizers could be misclassified as SVDQUANT. Use a value check (and includesvdquant_lora_b) to avoid false positives.🛠️ Proposed fix
- if input_quantizer is not None and hasattr(weight_quantizer, "svdquant_lora_a"): - return QUANTIZATION_NVFP4_SVDQUANT + if ( + input_quantizer is not None + and getattr(weight_quantizer, "svdquant_lora_a", None) is not None + and getattr(weight_quantizer, "svdquant_lora_b", None) is not None + ): + return QUANTIZATION_NVFP4_SVDQUANT
| if ( | ||
| args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1 | ||
| ) and args.qformat == "nvfp4_svdquant": | ||
| raise NotImplementedError("Svdquant does not support mulitple GPUs yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in error message.
"mulitple" should be "multiple".
✏️ Proposed fix
if (
args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1
) and args.qformat == "nvfp4_svdquant":
- raise NotImplementedError("Svdquant does not support mulitple GPUs yet.")
+ raise NotImplementedError("Svdquant does not support multiple GPUs yet.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ( | |
| args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1 | |
| ) and args.qformat == "nvfp4_svdquant": | |
| raise NotImplementedError("Svdquant does not support mulitple GPUs yet.") | |
| if ( | |
| args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1 | |
| ) and args.qformat == "nvfp4_svdquant": | |
| raise NotImplementedError("Svdquant does not support multiple GPUs yet.") |
🤖 Prompt for AI Agents
In `@examples/llm_ptq/hf_ptq.py` around lines 510 - 513, Update the
NotImplementedError message string in the conditional that checks
args.inference_tensor_parallel, args.inference_pipeline_parallel and
args.qformat ("nvfp4_svdquant") to fix the typo: change "mulitple" to "multiple"
so the raised message reads "Svdquant does not support multiple GPUs yet." Refer
to the conditional using args.inference_tensor_parallel,
args.inference_pipeline_parallel, args.qformat and the raised
NotImplementedError to locate the change.
What does this PR do?
Type of change: new feature
Overview:
Usage
cd ./examples/llm_ptq/ python hf_ptq.py \ --pyt_ckpt_path Qwen/Qwen3-4B \ --export_path /home/scratch.shiychen_coreai/quantized_models/Qwen3-4B-svdq \ --qformat nvfp4_awq_svdquant --kv_cache_qformat none --sparsity_fmt dense --calib_size 8Testing
exported checkpoint and loaded.
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Limitations
✏️ Tip: You can customize this high-level summary in your review settings.